Fix to match GitHub’s algorithm on unicode #38
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.
To do that, I created two scripts:
generate-fixtures.mjs
, which generates a markdown file, in part from manual fixtures and in part on the Unicode General Categories, creates a gist, crawls the gist, removes it, and saves fixtures annotated with the expected result from GitHubgenerate-regex.mjs
, which generates the regex that GitHub uses for characters to ignore.The regex is about 2.5kb minzipped. This increases the file size of this project a bit. But matching GitHub is worth it in my opinion. I also investigated regex
(space) into
\p{}
classes in/u
regexes. They work mostly fine, with two caveats: a) they don’t work everywhere, so would be a major release, b) GitHub does not implement the same Unicode version as browsers. I tested with Unicode 13 and 14, and they include characters that GitHub handles differently. In the end, GitHub’s algorithm is mostly fine: strip non-alphanumericals, allow-
, and turn-
.Finally, I removed the trim functionality, because it is not implemented by GitHub. To assert this, make a heading like so in a readme:
#  
. This is a space encoded as a character reference, meaning that the markdown does not see it as the whitespace between the#
and the content. In fact, this makes it the content. And GitHub creates a slug of-
for it.Further work: I think it would be nice to release this as is. Then, afterwards, I’d like to modernize the project, add GH Actions to generate the build, add types, and move to ESM.
/cc @Flet @jablko
Closes GH-22.
Closes GH-25.
Closes GH-35.